read data from an hdf rather than a csv #29

rmudambi · 2023-04-04T01:31:22Z

Read data from HDF rather than CSV

Description

Category: feature
JIRA issue: MIC-3942

Read data in from HDF rather than CSV
Fix errors in incorrect_select_options.csv
Fix issue with categorical dtypes by using NA instead of "" for missing data
Cast column to str dtype for typographic errors.

Testing

Ran integration tests against sample data generated by the updated simulation which outputs hdfs.
Ran automated test suite

stevebachmeier · 2023-04-04T19:33:13Z

src/pseudopeople/entity_types.py

            column.loc[to_noise_idx], configuration, randomness_stream, additional_key
        )
-
+        column.loc[to_noise_idx] = noised_data


Was this actually causing a problem or do you just find this more readable?

This just made debugging easier since I could put a breakpoint between the function call and the assignment to the series.

stevebachmeier · 2023-04-04T19:34:33Z

src/pseudopeople/interface.py

-    data = pd.read_csv(path, dtype=str, keep_default_na=False)
+    data = pd.read_hdf(path)
+    if not isinstance(data, pd.DataFrame):
+        raise TypeError(f"File located at {path} must contain a pandas DataFrame.")


Consider moving into a load_data utility function so that we don't forget to check type.

stevebachmeier · 2023-04-04T19:38:41Z

src/pseudopeople/noise_functions.py

@@ -303,6 +303,7 @@ def keyboard_corrupt(truth, corrupted_pr, addl_pr, rng):
    include_original_token_level = configuration.include_original_token_level

    rng = np.random.default_rng(seed=randomness_stream.seed)
+    column = column.astype(str)


This will convert any NaNs to "nan" and proceed to corrupt that. We shouldn't have any NaNs at this point though, right? B/c those get dropped up front when this gets called?

Yes, that's correct, but definitely great to call this out

read data from an hdf rather than a csv

595dec6

rmudambi requested review from albrja, hussain-jafari, mattkappel, ramittal and stevebachmeier as code owners April 4, 2023 01:31

stevebachmeier reviewed Apr 4, 2023

View reviewed changes

stevebachmeier approved these changes Apr 4, 2023

View reviewed changes

merge

0dd3362

albrja approved these changes Apr 5, 2023

View reviewed changes

rmudambi added 3 commits April 4, 2023 17:30

Merge branch 'develop' into feature/read-hdf

2f91586

fix bug introduced during merge

331b67b

formatting

8a3dafd

rmudambi merged commit 0c4290a into develop Apr 5, 2023

rmudambi deleted the feature/read-hdf branch April 5, 2023 00:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read data from an hdf rather than a csv #29

read data from an hdf rather than a csv #29

rmudambi commented Apr 4, 2023

stevebachmeier Apr 4, 2023

rmudambi Apr 4, 2023

stevebachmeier Apr 4, 2023 •

edited

Loading

stevebachmeier Apr 4, 2023

rmudambi Apr 4, 2023

read data from an hdf rather than a csv #29

read data from an hdf rather than a csv #29

Conversation

rmudambi commented Apr 4, 2023

Read data from HDF rather than CSV

Description

Testing

stevebachmeier Apr 4, 2023

Choose a reason for hiding this comment

rmudambi Apr 4, 2023

Choose a reason for hiding this comment

stevebachmeier Apr 4, 2023 • edited Loading

Choose a reason for hiding this comment

stevebachmeier Apr 4, 2023

Choose a reason for hiding this comment

rmudambi Apr 4, 2023

Choose a reason for hiding this comment

stevebachmeier Apr 4, 2023 •

edited

Loading